# LifeStateBench

**LifeStateBench** is a benchmark for evaluating large language models' ability to retain and reason about long-term state information.
 This repository includes all datasets and evaluation scripts, including new datasets created by Claude.

## 📚 Datasets

This repository includes the following datasets. Each dataset contains:

- **Full script** 
- **role cards** describing key information about each character
- **Fact-based question-answer pairs** that test memory retention and state reasoning

### 1. Midnight Diner

- **Languages**: Available in both **Chinese** and **English**
- **Contents**:
  - `script/`: Full episode script
  - `role cards/`: Character cards with descriptions (background, personality, relationships)
  - `QA_pair/`: Fact-based QA pairs extracted from the script each episode 

### 2. Hamlet 

**Note**: *To mitigate data leakage risks, all character names in the Hamlet dataset have been anonymized and replaced with neutral identifiers.*

- **Language**: English
- **Contents**:
  - `script/`: Full original play text (Shakespeare)
  - `role cards`: Character cards with descriptions (background, personality, relationships)
  - `QA_pair/`: Fact-based QA pairs extracted from the script each episode 

## 🧪 Evaluation Script

We provide an reference evaluation script `eval.py` to assess how well models retain factual and state-based information after reading the script and character profiles.



### 📚 Citation

If you find our work helpful, we would be grateful if you could consider citing the following paper:

```
@article{fan2025if,
  title={If an LLM Were a Character, Would It Know Its Own Story? Evaluating Lifelong Learning in LLMs},
  author={Fan, Siqi and Huang, Xiusheng and Yao, Yiqun and Fang, Xuezhi and Liu, Kang and Han, Peng and Shang, Shuo and Sun, Aixin and Wang, Yequan},
  journal={arXiv preprint arXiv:2503.23514},
  year={2025}
}
```

